library(dplyr)
library(tidyverse)
library(stringr)
library(ggplot2)With the purpose of finding how much the 4P’s of marketing affect game ownership on Steam and its importance, 3 main questions were analyzed. The three questions are as follows: what does the Steam market look like, will the genre affect game ownership, and whether price expectations differ between genres.
The data set sourced from Kaggle was first cleaned using the Dplyr and Tidyverse packages. Then, moving to the analysis, knowing that Steam is a site for distributing games rather than a variety of selected games, it was decided that of the 4P’s, place should be the first aspect considered. With the majority of games having very few players compared to the most popular games, the data showed a negative exponential distribution, meaning that to gain notoriety on steam, games need to be marketed which begins with the product. With this information the next step was to narrow the data and look into the top 250 games and then find the genre that appeared the most in that subset, the action genre. Afterwards the focus was shifted to look at steam from a developer looking to distribute a Simulation game.
Having analyzed the results, it was found that game ownership is indeed a good metric for a successful game due to how many aspects show strong correlations to the estimated ownership of the game. With this in mind, analysis shows that simulation games tend to garner less players than the action genre Additionally, games in the Simulation genre are typically priced higher than the most popular games showing that such games typically are more complex than games only in the action genre. Knowing this, the data shows that the genre, or product in marketing terms, changes the price and how well the game may be received on Steam.
Steam is one of the largest online marketplaces for PC games offering over 100,000 games on the platform. Steam game catalogue is massive, meaning that the ownership of a game is not necessarily indicative of its quality or enjoyment, more so, the game has been buried. The purpose of this project is to see how games perform on steam by analyzing the platform itself, looking to see if the product (genre) affects ownership, then seeing if price expectations change between genres. What follows is an organized approach to the analysis:
Using this data the hope is to create a presentation, find inspiration, and most importantly, present the data in a way that is readable and understandable. With the help of R Studio, the goal is to evaluate whether understanding the 4Ps of marketing can help developers increase ownership of their game on Steam. To do this, Steam’s marketplace will be analyzed from the perspective of a developer looking to post a Simulation game, by analyzing the top genres and popular games, and discovering how much place, product, and price matter on the platform.
| Variable Name | What the Variable Represents | Numerical or Categorical? |
|---|---|---|
| AppID | App ID is a unique ID that enables Steam to Differentiate one game from another despite its name | Numerical |
| Name | The name of the game | Categorical |
| Estimated.owners | The estimated number of people who own the game. Gives a range | Categorical |
| Release.date | The date the game released | Categorical |
| Price | The base price of the game | Categorical |
| Supported.languages | Details the languages that are supported by the game | Categorical |
| User.score | Steam’s way of summarizing player recommendations. It is a Percentage calculated by finding the number of positive reviews then dividing by total reviews | Numerical |
| Positive | Number of positive reviews | Numerical |
| Negative | Number of negative reviews | Numerical |
| Achievements | Number of achievements available in the game | Numerical |
| Recommendations | Number of recommendations | Numerical |
| Average.playtime.two weeks | Average play time over two week measured in hours | Numerical |
| Average.playtime.forever | Average total playtime | Numerical |
| Median.playtime.two.weeks | Median playtime over two weeks (less susceptible to outliers) | Numerical |
| Median.playtime.forever | Median total playtime | Numerical |
| Developers | The developers of the game | Categorical |
| Publishers | The publishers of the game | Categorical |
| Categories | Denotes what kind of game it is (online, single-player, ect.) | Categorical |
| Genres | Denotes the multiple genres of the game | Categorical |
| Tags | More specific than ‘Genres’ (clicker, agriculture, sandbox) | Categorical |
The Data was scraped from Steam meaning there are many values that have odd formatting. Originally, it was believed this was an issue with how the data was gathered and that those values were This being said, after some research it was found that the symbols in the “Name” column was a form of Unicode and the text is perfectly decipherable. This occurs as a result of using characters from different languages. Changing those values into Unicode is the computer’s way of representing characters from almost all of the world written languages. In essence, the Unicode present in the data is not a result of bad scraping. This being said, a few NA values were found, but these values have little affect on the calculations and have posed little problems outside of the user score correlation analysis. This column showed some major NA values; however, this would only be a problem if it was decided to look outside of the project scope.
The data is rather tidy with the exception of the Genres Column. The Genres column contains a list of tagged genres rather than a single value. I decided to leave the Genres column alone because much of the analysis will rely on that single column. If the values were to be separated, the amount of columns per row would be different due to Steam’s allowance of multiple genre tags. In consideration of needing to keep the data together, the grepl() function will be a big help the focus is narrowed to specific genres within the column. The function allows R to find specific strings within columns rather than splitting the genres column as will be done to find the most common genres on Steam.
Though the data is very consistent, there are a few difficulties in places where text is formatted, particularly when it comes to games with special characters or formatting not available on the typical English keyboard. This is only a minor convenience because the text can still be decoded. Additionally, there are a few columns that could be better utilized if they were numerical values. The main offender is the estimated owner’s column. To fix this issue, the values will be separated into min and max then the average of both will be mutated into a new column which most of the analysis will be completed with.
#4a. Read in data
OG_Steam <- read.csv("Steam_Games.csv", fill = TRUE)
OG_SteamAppID <int> | |
|---|---|
| 20200 | |
| 655370 | |
| 1732930 | |
| 1355720 | |
| 1139950 | |
| 1469160 | |
| 1659180 | |
| 1968760 | |
| 1178150 | |
| 320150 |
NA#4b.1 show a summary of the data and the first few observations
#Summary
summary(OG_Steam) AppID Name Release.date Estimated.owners Peak.CCU Required.age Price
Min. : 10 Length:109780 Length:109780 Length:109780 Min. : 0.0 Min. : 0.0000 Min. : 0.000
1st Qu.: 925058 Class :character Class :character Class :character 1st Qu.: 0.0 1st Qu.: 0.0000 1st Qu.: 0.990
Median :1642805 Mode :character Mode :character Mode :character Median : 0.0 Median : 0.0000 Median : 3.990
Mean :1705773 Mean : 178.7 Mean : 0.2569 Mean : 7.041
3rd Qu.:2430293 3rd Qu.: 1.0 3rd Qu.: 0.0000 3rd Qu.: 9.990
Max. :3671840 Max. :1311366.0 Max. :21.0000 Max. : 999.980
About.the.game Supported.languages Full.audio.languages Reviews Header.image Website
Length:109780 Length:109780 Length:109780 Length:109780 Length:109780 Length:109780
Class :character Class :character Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character
Support.url Support.email Windows Mac Linux Metacritic.score Metacritic.url
Length:109780 Length:109780 Mode :logical Mode :logical Mode :logical Min. : 0.000 Length:109780
Class :character Class :character FALSE:33 FALSE:90621 FALSE:96265 1st Qu.: 0.000 Class :character
Mode :character Mode :character TRUE :109747 TRUE :19159 TRUE :13515 Median : 0.000 Mode :character
Mean : 2.658
3rd Qu.: 0.000
Max. :97.000
User.score Positive Negative Score.rank Achievements Recommendations
Min. : 0.00000 Min. : 0.0 Min. : 0.0 Min. : 97.00 Min. : 0.00 Min. : 0.0
1st Qu.: 0.00000 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 98.00 1st Qu.: 0.00 1st Qu.: 0.0
Median : 0.00000 Median : 4.0 Median : 1.0 Median : 99.00 Median : 0.00 Median : 0.0
Mean : 0.03087 Mean : 764.9 Mean : 127.2 Mean : 98.91 Mean : 17.64 Mean : 624.5
3rd Qu.: 0.00000 3rd Qu.: 29.0 3rd Qu.: 8.0 3rd Qu.:100.00 3rd Qu.: 17.00 3rd Qu.: 0.0
Max. :100.00000 Max. :5764420.0 Max. :895978.0 Max. :100.00 Max. :9821.00 Max. :3441592.0
NA's :109736
Notes Average.playtime.forever Average.playtime.two.weeks Median.playtime.forever Median.playtime.two.weeks
Length:109780 Min. : 0.00 Min. : 0.000 Min. : 0.00 Min. : 0.000
Class :character 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 0.000
Mode :character Median : 0.00 Median : 0.000 Median : 0.00 Median : 0.000
Mean : 82.29 Mean : 9.213 Mean : 73.59 Mean : 9.909
3rd Qu.: 0.00 3rd Qu.: 0.000 3rd Qu.: 0.00 3rd Qu.: 0.000
Max. :145727.00 Max. :19159.000 Max. :208473.00 Max. :19159.000
Developers Publishers Categories Genres Tags
Length:109780 Length:109780 Length:109780 Length:109780 Length:109780
Class :character Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character Mode :character
# This needed to be shown in a new window
head(OG_Steam)AppID <int> | Name <chr> | Release.date <chr> | Estimated.owners <chr> | Peak.CCU <int> | Required.age <int> | Price <dbl> | ||
|---|---|---|---|---|---|---|---|---|
| 1 | 20200 | Galactic Bowling | 21-Oct-08 | 0 - 20000 | 0 | 0 | 19.99 | |
| 2 | 655370 | Train Bandit | 12-Oct-17 | 0 - 20000 | 0 | 0 | 0.99 | |
| 3 | 1732930 | Jolt Project | 17-Nov-21 | 0 - 20000 | 0 | 0 | 4.99 | |
| 4 | 1355720 | Henosisâ„¢ | 23-Jul-20 | 0 - 20000 | 0 | 0 | 5.99 | |
| 5 | 1139950 | Two Weeks in Painland | 3-Feb-20 | 0 - 20000 | 0 | 0 | 0.00 | |
| 6 | 1469160 | Wartune Reborn | 26-Feb-21 | 50000 - 100000 | 68 | 0 | 0.00 |
#4b.2 First few observations
head(OG_Steam)AppID <int> | Name <chr> | Release.date <chr> | Estimated.owners <chr> | Peak.CCU <int> | Required.age <int> | Price <dbl> | ||
|---|---|---|---|---|---|---|---|---|
| 1 | 20200 | Galactic Bowling | 21-Oct-08 | 0 - 20000 | 0 | 0 | 19.99 | |
| 2 | 655370 | Train Bandit | 12-Oct-17 | 0 - 20000 | 0 | 0 | 0.99 | |
| 3 | 1732930 | Jolt Project | 17-Nov-21 | 0 - 20000 | 0 | 0 | 4.99 | |
| 4 | 1355720 | Henosisâ„¢ | 23-Jul-20 | 0 - 20000 | 0 | 0 | 5.99 | |
| 5 | 1139950 | Two Weeks in Painland | 3-Feb-20 | 0 - 20000 | 0 | 0 | 0.00 | |
| 6 | 1469160 | Wartune Reborn | 26-Feb-21 | 50000 - 100000 | 68 | 0 | 0.00 |
NA#4c. Create a new data set from the variables that are of interest
Steam_Games_Pre <- OG_Steam %>%
select(AppID, Name, Release.date, Estimated.owners, Price, Supported.languages, User.score, Positive, Negative, Achievements, Recommendations, Average.playtime.two.weeks, Average.playtime.forever, Median.playtime.two.weeks, Median.playtime.forever, Developers, Publishers, Categories, Genres, Tags)
#Example of how this dataset may be used
Publisher_Price_Summary <- Steam_Games_Pre%>%
group_by(Publishers) %>%
summarise(Avg_Price = mean(Price)) %>%
arrange(-Avg_Price)
Publisher_Price_SummaryPublishers <chr> | Avg_Price <dbl> | |
|---|---|---|
| A&S Inc. | 999.98000 | |
| Fury Games | 999.00000 | |
| Whoes heart broken | 500.00000 | |
| SideFX | 269.99000 | |
| 3Dflow SRL | 199.99000 | |
| AT_Games | 199.99000 | |
| AssetFlipGames World Game Publishing,ALFINA WORLD GAME PUBLISHING,rocketship | 199.99000 | |
| BestEntrepeneurs | 199.99000 | |
| CatCat Gaming | 199.99000 | |
| Colyu | 199.99000 |
summary(Steam_Games_Pre) AppID Name Release.date Estimated.owners Price Supported.languages
Min. : 10 Length:109780 Length:109780 Length:109780 Min. : 0.000 Length:109780
1st Qu.: 925058 Class :character Class :character Class :character 1st Qu.: 0.990 Class :character
Median :1642805 Mode :character Mode :character Mode :character Median : 3.990 Mode :character
Mean :1705773 Mean : 7.041
3rd Qu.:2430293 3rd Qu.: 9.990
Max. :3671840 Max. : 999.980
User.score Positive Negative Achievements Recommendations Average.playtime.two.weeks
Min. : 0.00000 Min. : 0.0 Min. : 0.0 Min. : 0.00 Min. : 0.0 Min. : 0.000
1st Qu.: 0.00000 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 0.00 1st Qu.: 0.0 1st Qu.: 0.000
Median : 0.00000 Median : 4.0 Median : 1.0 Median : 0.00 Median : 0.0 Median : 0.000
Mean : 0.03087 Mean : 764.9 Mean : 127.2 Mean : 17.64 Mean : 624.5 Mean : 9.213
3rd Qu.: 0.00000 3rd Qu.: 29.0 3rd Qu.: 8.0 3rd Qu.: 17.00 3rd Qu.: 0.0 3rd Qu.: 0.000
Max. :100.00000 Max. :5764420.0 Max. :895978.0 Max. :9821.00 Max. :3441592.0 Max. :19159.000
Average.playtime.forever Median.playtime.two.weeks Median.playtime.forever Developers Publishers Categories
Min. : 0.00 Min. : 0.000 Min. : 0.00 Length:109780 Length:109780 Length:109780
1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0.00 Class :character Class :character Class :character
Median : 0.00 Median : 0.000 Median : 0.00 Mode :character Mode :character Mode :character
Mean : 82.29 Mean : 9.909 Mean : 73.59
3rd Qu.: 0.00 3rd Qu.: 0.000 3rd Qu.: 0.00
Max. :145727.00 Max. :19159.000 Max. :208473.00
Genres Tags
Length:109780 Length:109780
Class :character Class :character
Mode :character Mode :character
#4d. Name the new dataset and write it to a csv file
write.csv(Steam_Games_Pre, "Steam_Games_Cleaned.csv")# Read in Data
Steam_Games <- read.csv("Steam_Games_Cleaned.csv")
Steam_GamesX <int> | AppID <int> | |
|---|---|---|
| 1 | 20200 | |
| 2 | 655370 | |
| 3 | 1732930 | |
| 4 | 1355720 | |
| 5 | 1139950 | |
| 6 | 1469160 | |
| 7 | 1659180 | |
| 8 | 1968760 | |
| 9 | 1178150 | |
| 10 | 320150 |
# Get rid of 'X' column from Steam_Games_Cleaned
Steam_Games$X <- NULL# Find the column names
colnames(Steam_Games) [1] "AppID" "Name" "Release.date" "Estimated.owners"
[5] "Price" "Supported.languages" "User.score" "Positive"
[9] "Negative" "Achievements" "Recommendations" "Average.playtime.two.weeks"
[13] "Average.playtime.forever" "Median.playtime.two.weeks" "Median.playtime.forever" "Developers"
[17] "Publishers" "Categories" "Genres" "Tags"
# Summary Statistics
summary(Steam_Games_Pre) AppID Name Release.date Estimated.owners Price Supported.languages
Min. : 10 Length:109780 Length:109780 Length:109780 Min. : 0.000 Length:109780
1st Qu.: 925058 Class :character Class :character Class :character 1st Qu.: 0.990 Class :character
Median :1642805 Mode :character Mode :character Mode :character Median : 3.990 Mode :character
Mean :1705773 Mean : 7.041
3rd Qu.:2430293 3rd Qu.: 9.990
Max. :3671840 Max. : 999.980
User.score Positive Negative Achievements Recommendations Average.playtime.two.weeks
Min. : 0.00000 Min. : 0.0 Min. : 0.0 Min. : 0.00 Min. : 0.0 Min. : 0.000
1st Qu.: 0.00000 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 0.00 1st Qu.: 0.0 1st Qu.: 0.000
Median : 0.00000 Median : 4.0 Median : 1.0 Median : 0.00 Median : 0.0 Median : 0.000
Mean : 0.03087 Mean : 764.9 Mean : 127.2 Mean : 17.64 Mean : 624.5 Mean : 9.213
3rd Qu.: 0.00000 3rd Qu.: 29.0 3rd Qu.: 8.0 3rd Qu.: 17.00 3rd Qu.: 0.0 3rd Qu.: 0.000
Max. :100.00000 Max. :5764420.0 Max. :895978.0 Max. :9821.00 Max. :3441592.0 Max. :19159.000
Average.playtime.forever Median.playtime.two.weeks Median.playtime.forever Developers Publishers Categories
Min. : 0.00 Min. : 0.000 Min. : 0.00 Length:109780 Length:109780 Length:109780
1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0.00 Class :character Class :character Class :character
Median : 0.00 Median : 0.000 Median : 0.00 Mode :character Mode :character Mode :character
Mean : 82.29 Mean : 9.909 Mean : 73.59
3rd Qu.: 0.00 3rd Qu.: 0.000 3rd Qu.: 0.00
Max. :145727.00 Max. :19159.000 Max. :208473.00
Genres Tags
Length:109780 Length:109780
Class :character Class :character
Mode :character Mode :character
Currently the estimated owners column is in a character format meaning it will need to be converted to a numerical datatype.
# Use dim (outcome is 109780 rows and 20 columns)
dim(Steam_Games)[1] 109780 20
There are 109780 rows in this data set and 20 columns.
# Update data set so min and max estimated players have their own columns
Steam_Games <- Steam_Games %>%
separate(
col = Estimated.owners,
into = c('Min.estimated.owners', 'Max.estimated.owners'),
sep = "-",
)
#Convert Columns to numerical value
Steam_Games$Max.estimated.owners <- as.numeric(Steam_Games$Max.estimated.owners)
Steam_Games$Min.estimated.owners <- as.numeric(Steam_Games$Min.estimated.owners)#Convert back to average estimated owners (most likely value)
Steam_Games <- Steam_Games %>%
mutate(Owners.avg = (Min.estimated.owners + Max.estimated.owners) / 2)
head(Steam_Games$Owners.avg, 15) [1] 10000 10000 10000 10000 10000 75000 10000 10000 10000 75000 35000 75000 10000 35000 10000
ggplot(data = Steam_Games, aes(x = Owners.avg)) +
geom_bar(fill = "cadetblue3", color = "cadetblue") +
labs(
title = "Number of Games by Log of Estimated Owners",
y = "Number of Games",
x = "Estimated Owners"
) +
scale_x_log10()
The graph shows that an overwhelmingly large amount of games on steam have a small player base. This confirms that a player base of around 1000 players is considered to be about average for smaller game studios on steam regardless of genre.
# Splitting the values from the 'Genres' column
Genres <- data.frame(
Genres = Steam_Games$Genres
)
Genres_Split <- Genres %>%
separate_longer_delim(Genres, delim = ",")
Genres_SplitGenres <chr> | ||||
|---|---|---|---|---|
| Casual | ||||
| Indie | ||||
| Sports | ||||
| Action | ||||
| Indie | ||||
| Action | ||||
| Adventure | ||||
| Indie | ||||
| Strategy | ||||
| Adventure |
NA# Find the count for each Genre
Genres_Count <- Genres_Split %>%
count(Genres, sort = TRUE)
# Find top 8 genre
Top_8_Genres <- Genres_Count %>%
slice_head(n = 8)
# Print top 8
print(Top_8_Genres)Genres <chr> | n <int> | |||
|---|---|---|---|---|
| Indie | 72080 | |||
| Casual | 44412 | |||
| Action | 42037 | |||
| Adventure | 40166 | |||
| Simulation | 21109 | |||
| Strategy | 19947 | |||
| RPG | 18796 | |||
| Early Access | 13704 |
Top_8_Vec <- c("Indie", "Casual", "Action", "Adventure", "Simulation", "Strategy", "RPG", "Early Access")
Genres_Filter <- Genres_Split %>%
filter(Genres_Split == Top_8_Vec)
print(Genres_Filter)Genres <chr> | ||||
|---|---|---|---|---|
| Strategy | ||||
| Casual | ||||
| Casual | ||||
| Indie | ||||
| Simulation | ||||
| Indie | ||||
| Casual | ||||
| Simulation | ||||
| Strategy | ||||
| Indie |
# Summarize genre counts and calculate percentages
Genres_Filter <- Genres_Filter %>%
count(Genres) %>%
mutate(Percentage = round((n / sum(n)) * 100, 1))
# Pie chart with percentage labels
ggplot(Genres_Filter, aes(x = "", y = n, fill = Genres)) +
geom_bar(stat = "identity", color = "white") +
coord_polar(theta = "y") +
geom_text(aes(label = paste0(Percentage, "%")),
position = position_stack(vjust = 0.5),
color = "white", size = 3.5) +
labs(
title = "Most Common Genres on Steam (Pie Chart)",
x = NULL,
y = NULL,
fill = "Genre"
) +
theme_void() +
theme(
plot.title = element_text(hjust = 0.5, face = "bold"),
legend.position = "right"
) +
scale_fill_manual(values = c("cadetblue2","cadetblue3", "cadetblue","cadetblue4","mediumseagreen", "seagreen","green4","darkgreen"))The above graph shows how many games on steam are categorized under different genres. This graph in particular only shows the top 8 genres due to how many games use these genre tags specifically. The indie genre is the most popular tag. This shows that most games on Steam are made by independent developers. Additionally, it seems that many games use the casual tag.
The ‘Preferred_Games’ vector contains games that I find to be enagaging and enjoyable. With the goal of finding which genre I lean towards the most. Every genre is different, Choosing a genre I know the most about will help me identify key aspects that make those games stand out among
# Preferred Games
# Check Names (Not checked)
Preferred_Games <- c("Cinnabunny", "Potion Craft: Alchemist Simulator", "Stardew Valley", "Cities: Skylines", "Good Pizza, Great Pizza - Cooking Simulator Game", "Papa's Freezeria Deluxe", "Sticky Business", "Slime Rancher 2", "Albion Online", "Squirreled Away")
#grab games
Favorite_Games <- Steam_Games %>%
filter(Name %in% Preferred_Games)
#print favorite games
Favorite_GamesAppID <int> | Name <chr> | Release.date <chr> | Min.estimated.owners <dbl> | |
|---|---|---|---|---|
| 255710 | Cities: Skylines | 10-Mar-15 | 5e+06 | |
| 761890 | Albion Online | 16-May-18 | 2e+06 | |
| 770810 | Good Pizza, Great Pizza - Cooking Simulator Game | 5-Jun-18 | 5e+04 | |
| 1210320 | Potion Craft: Alchemist Simulator | 21-Sep-21 | 5e+05 | |
| 413150 | Stardew Valley | 26-Feb-16 | 1e+07 | |
| 1657630 | Slime Rancher 2 | 22-Sep-22 | 2e+05 | |
| 2291760 | Papa's Freezeria Deluxe | 31-Mar-23 | 2e+04 | |
| 2303350 | Sticky Business | 17-Jul-23 | 2e+04 | |
| 2794830 | Cinnabunny | 19-Feb-25 | 0e+00 | |
| 2977620 | Squirreled Away | 28-Mar-25 | 0e+00 |
# Find the top 250 games in the Steam library
Top_250_Games <- Steam_Games %>%
arrange(desc(Owners.avg)) %>%
head(250)
correlation_matrix_250 <- Top_250_Games %>%
select(
Owners.avg, Price, Positive, Negative,
Achievements, Recommendations,
Average.playtime.two.weeks, Average.playtime.forever) %>%
mutate(across(everything(), as.numeric)) %>% # converts all columns to the numeric datatype
cor(use = "pairwise.complete.obs") # handles missing values
correlation_matrix_250 Owners.avg Price Positive Negative Achievements Recommendations Average.playtime.two.weeks
Owners.avg 1.0000000 -0.10669394 0.60153627 0.59286065 0.10950162 0.47075473 0.23329474
Price -0.1066939 1.00000000 -0.02363854 -0.05373682 0.09503967 0.06664440 -0.03616313
Positive 0.6015363 -0.02363854 1.00000000 0.75897948 0.11084977 0.91596727 0.10686957
Negative 0.5928606 -0.05373682 0.75897948 1.00000000 0.04239992 0.79744867 0.13494968
Achievements 0.1095016 0.09503967 0.11084977 0.04239992 1.00000000 0.08891216 0.05274189
Recommendations 0.4707547 0.06664440 0.91596727 0.79744867 0.08891216 1.00000000 0.06745812
Average.playtime.two.weeks 0.2332947 -0.03616313 0.10686957 0.13494968 0.05274189 0.06745812 1.00000000
Average.playtime.forever 0.6193225 0.04583561 0.62775208 0.63432460 0.16631237 0.55232036 0.22612711
Average.playtime.forever
Owners.avg 0.61932248
Price 0.04583561
Positive 0.62775208
Negative 0.63432460
Achievements 0.16631237
Recommendations 0.55232036
Average.playtime.two.weeks 0.22612711
Average.playtime.forever 1.00000000
Based on this correlation matrix, I would like to focus on 3 different aspects for further analysis; Owners.avg, player engagement of all kinds (Positive, Negative, and Recommendations), and Price. Owners.avg has the biggest impact on every aspect of the game while player engagement seems to drive how many copies of the game are sold. Though the price column does not seem to have any strong correlations, it is the aspect with the most slightly negative relationships in most aspects while increasing forever playtime.
For my next series of analysis I chose to focus on the Top 250 games of each genre and popularity. Because Steam is such a large platform the most common aspect of a genre is not neccesarily the most successful as shown by the player count bar graph.
#Now do the same steps from finding the Top_250_Games to find the 250 most popular games within the simulation genre
#Make a subset:
#Subset of simulation games
Sim_Games <- Steam_Games[grepl("Simulation", Steam_Games$Genres), ]
#Find the top 250 from sim games
Top_250_Sims <- Sim_Games %>%
arrange(desc(Owners.avg)) %>%
head(250)
Top_250_SimsAppID <int> | Name <chr> | Release.date <chr> | ||
|---|---|---|---|---|
| 1 | 4000 | Garry's Mod | 29-Nov-06 | |
| 2 | 236390 | War Thunder | 15-Aug-13 | |
| 3 | 1468810 | 鬼谷八è\u008d’ Tale of Immortal | 27-Jan-21 | |
| 4 | 242760 | The Forest | 30-Apr-18 | |
| 5 | 261550 | Mount & Blade II: Bannerlord | 30-Mar-20 | |
| 6 | 552990 | World of Warships | 15-Nov-17 | |
| 7 | 417910 | Street Warriors Online | 16-Dec-16 | |
| 8 | 291480 | Warface | 1-Jul-14 | |
| 9 | 301520 | Robocraft | 24-Aug-17 | |
| 10 | 477160 | Human: Fall Flat | 22-Jul-16 |
Top_250_GamesAppID <int> | Name <chr> | Release.date <chr> | Min.estimated.owners <dbl> | ||
|---|---|---|---|---|---|
| 1 | 570 | Dota 2 | 9-Jul-13 | 1e+08 | |
| 2 | 1063730 | New World | 28-Sep-21 | 5e+07 | |
| 3 | 578080 | PUBG: BATTLEGROUNDS | 21-Dec-17 | 5e+07 | |
| 4 | 440 | Team Fortress 2 | 10-Oct-07 | 5e+07 | |
| 5 | 730 | Counter-Strike: Global Offensive | 21-Aug-12 | 5e+07 | |
| 6 | 2358720 | Black Myth: Wukong | 19-Aug-24 | 5e+07 | |
| 7 | 1172470 | Apex Legendsâ„¢ | 4-Nov-20 | 2e+07 | |
| 8 | 4000 | Garry's Mod | 29-Nov-06 | 2e+07 | |
| 9 | 1085660 | Destiny 2 | 1-Oct-19 | 2e+07 | |
| 10 | 359550 | Tom Clancy's Rainbow Six® Siege | 1-Dec-15 | 2e+07 |
# Now that we have our broad sets, Create a grouped bar chart to compare player bases
# Select Columns Sets to include max owners and Name
Top_250_Games_MO <- Top_250_Games %>%
select(Owners.avg, Name)
Top_250_Sims_MO <- Top_250_Sims %>%
select(Owners.avg, Name)
# Convert data to be in the same set
Top_250_Games_MO$Source <- "General"
Top_250_Sims_MO$Source <- "Simulations"
Top_Compare_250 <- rbind(Top_250_Games_MO, Top_250_Sims_MO) #rbind takes the two columns and makes the two one
# Summarize the data to get the counts
summary_data <- Top_Compare_250 %>%
group_by(Source) %>%
summarise(Count = n()) # n() counts the number of rows in each group
# Create Bins
Top_Compare_250$Owner_Bins <- cut(Top_Compare_250$Owners.avg, breaks = 20)
# 2. Plot the data using the new bins
ggplot(Top_Compare_250, aes(x = Owner_Bins, fill = Source)) +
geom_bar(position = "dodge", color = "cadetblue4") +
labs(
title = "Simulation and Top 250 Games Comparison by Estimated Owners",
x = "Estimated Owners",
y = "Count of Games",
fill = "Data Source"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
scale_fill_manual(values = c("cadetblue3", "darkgrey"))
The above graph shows that, when compared to the top 250 games on steam, more simulation games are, The wider bars also show an interesting aspect of the simulation genre as a whole.It shows that most of the top 250 games of the simulation genre don’t reach the same player counts as the action genre might. Additionally the graph shows that a few games sit at the top of Steam’s roster.
Visualization of previous correlation analysis between player engagement and estimated ownership
# create percent positive reviews
Top_250_Games <- Top_250_Games %>%
mutate(Positive_PE = (Positive / (Positive + Negative)))
# Making the scatter plot with a log-transformed X-axis
ggplot(data = Top_250_Games, aes(x = Owners.avg, y = Positive_PE)) +
geom_point(color = "grey2", alpha = 0.05, size = 2) +
# Add the linear trendline using geom_smooth()
geom_smooth(method = "lm", se = FALSE, color = "cadetblue4") +
scale_x_log10() +
labs(title = "Top 250 Games Positive Ratings %",
x = "Estimated Owners (Log Scale)",
y = "Percent Positive") +
theme_minimal()The above graph shows the percent of positive rating given for a game and the estimated owners for that game. The games in this plot are apart of the top 250 games on Steam. Games tend to have a percent positive rating of around 85% with the percentage slightly decreasing as there are less games down the line.
#Mutate whole dataset
Steam_Games <- Steam_Games %>%
mutate(Positive_PE = (Positive / (Positive + Negative)))
# Making the scatter plot with a log-transformed X-axis
ggplot(data = Steam_Games, aes(x = Owners.avg, y = Positive_PE)) +
geom_point(color = "mediumseagreen", alpha = 0.05, size = 2) +
geom_smooth(method = "lm", se = FALSE, color = "darkgreen") + # Adds a linear trendline
scale_x_log10() +
labs(title = "Games Positive Ratings % ",
x = "Estimated Owners (Log Scale)",
y = "Percent Positive Ratio") +
theme_minimal()This graph shows a similar trend to the previous graph but instead, it shows a slight increase from 75%. Additionally, as the number of owners increases the variety of ratings becomes more standard. This data is sourced from the whole data set rather than a subset like the previous set.
#Finding what percent of top 250 games contain action in their genres column (there are 250 rows in this set)
#In Top_Games_250 use grepl to find all games with then save to Action_Games_from_250
Action_Games_from_250 <- Top_250_Games[grepl("Action", Top_250_Games$Genres), ]
# Find percent
Percent_Action = (nrow(Action_Games_from_250)/250)*100
Percent_Action[1] 70.8
Based on the Above code, games that use the action tag make up 70.8% of the top 250 games. The data set was ranked by estimated ownership meaning, of the most popular games 70.8% of them fall under the action genre, making action games the most popular genre on Steam by ownership.
#For the next bit of code involving pie charts, I am finding the top 250 action games so I may compare a very successful genre to the simulation genre.
Top_250_Action <- Steam_Games %>%
filter(grepl("Action", Genres)) %>%
arrange(desc(Owners.avg)) %>%
head(250)
# Print the result
Top_250_ActionAppID <int> | Name <chr> | Release.date <chr> | Min.estimated.owners <dbl> | ||
|---|---|---|---|---|---|
| 1 | 570 | Dota 2 | 9-Jul-13 | 1e+08 | |
| 2 | 1063730 | New World | 28-Sep-21 | 5e+07 | |
| 3 | 578080 | PUBG: BATTLEGROUNDS | 21-Dec-17 | 5e+07 | |
| 4 | 440 | Team Fortress 2 | 10-Oct-07 | 5e+07 | |
| 5 | 730 | Counter-Strike: Global Offensive | 21-Aug-12 | 5e+07 | |
| 6 | 2358720 | Black Myth: Wukong | 19-Aug-24 | 5e+07 | |
| 7 | 1172470 | Apex Legendsâ„¢ | 4-Nov-20 | 2e+07 | |
| 8 | 1085660 | Destiny 2 | 1-Oct-19 | 2e+07 | |
| 9 | 359550 | Tom Clancy's Rainbow Six® Siege | 1-Dec-15 | 2e+07 | |
| 10 | 236390 | War Thunder | 15-Aug-13 | 2e+07 |
NANarrowing my scope I decided to pull the top 250 action games from the Steam Library so I may compare the simulation genre against the most popular genre. Just as with the Simulation genre, games on steam are allowed more than one genre tag meaning if a game is tagged with both the action and simulation tag, the game will be contained in both sets.
# 1. Define the data (numeric vector of counts)
#In Top_Games_250 use grepl to find all games with then save to Action_Games_from_250
F2P_Games_from_250 <- Top_250_Games[grepl("Free to Play", Top_250_Games$Genres), ]
# 2. Find percentages
Percent_F2P = round(((nrow(F2P_Games_from_250)/250)*100),2)
Not_Free = -Percent_F2P + 100
# 3. Making the counts variable
counts <- c(Percent_F2P, Not_Free)
# 2. Define the labels for the slices (character vector)
labels <- c("Free to Play", "Not Free to Play")
# 3. Define the colors
colors <- c("gray", "darkgray")
# 4. Generate the pie chart using the 'pie()' function
F2P_Pie_250 <- pie(
x = counts, # The numeric values
labels = labels, # The text labels for each slice
main = "Percent Free to Play (Top 250)", # Main title
col = colors # Colors for the slices
)
#Make Sim Free to play pie chart
# 1. Define the data (numeric vector of counts)
#In Top_Games_250 use grepl to find all games with then save to Action_Games_from_250
F2P_Games_from_Sim <- Top_250_Sims[grepl("Free to Play", Top_250_Sims$Genres), ]
# 2. Find percentages
Percent_F2P2 = round(((nrow(F2P_Games_from_Sim)/250)*100),2)
Not_Free2 = -Percent_F2P2 + 100
# 3. Making the counts variable
counts <- c(Percent_F2P2, Not_Free2)
# 2. Define the labels for the slices (character vector)
labels <- c("Free to Play", "Not Free to Play")
# 3. Define the colors
colors <- c("cadetblue3", "cadetblue")
# 4. Generate the pie chart using the 'pie()' function
F2P_Pie_Sims <- pie(
x = counts, # The numeric values
labels = labels, # The text labels for each slice
main = "Percent Free to Play (Top Sims)", # Main title
col = colors # Colors for the slices
)
F2P_Pie_250NULL
F2P_Pie_SimsNULL
#Make Action Free to play pie chart
# 1. Define the data (numeric vector of counts)
#In Top_Games_250 use grepl to find all games with then save to Action_Games_from_250
F2P_Games_from_Action <- Top_250_Action[grepl("Free to Play", Top_250_Action$Genres), ]
# 2. Find percentages
Percent_F2P3 = round(((nrow(F2P_Games_from_Action)/250)*100),2)
Not_Free3 = -Percent_F2P3 + 100
# 3. Making the counts variable
counts <- c(Percent_F2P3, Not_Free3)
# 2. Define the labels for the slices (character vector)
labels <- c("Free to Play", "Not Free to Play")
# 3. Define the colors
colors <- c("mediumseagreen", "darkgreen")
# 4. Generate the pie chart using the 'pie()' function
F2P_Pie_Action <- pie(
x = counts, # The numeric values
labels = labels, # The text labels for each slice
main = "Percent Free to Play (Top Action)", # Main title
col = colors # Colors for the slices
)
F2P_Pie_250NULL
F2P_Pie_SimsNULL
F2P_Pie_ActionNULL
Despite lower priced games being more accessible, the majority of games in each respective catagory are roughly 4 times more likely to not be free to play. Simulation games are the least likely to be free to play.
print("Percent Free (250, Sim, Action) ")[1] "Percent Free (250, Sim, Action) "
Percent_F2P[1] 26.8
Percent_F2P2[1] 18
Percent_F2P3[1] 26
print("Percent Not Free (250, Sim, Action) ")[1] "Percent Not Free (250, Sim, Action) "
Not_Free[1] 73.2
Not_Free2[1] 82
Not_Free3[1] 74
The above code shows the exact percentages for each of the pie charts.
price_df <- data.frame(
Games_Price = Top_250_Games$Price,
Sims_Price = Top_250_Sims$Price,
Action_Price = Top_250_Action$Price
)
summary(price_df) Games_Price Sims_Price Action_Price
Min. : 0.00 Min. : 0.000 Min. : 0.00
1st Qu.: 0.00 1st Qu.: 2.115 1st Qu.: 0.00
Median : 9.99 Median :14.990 Median : 9.99
Mean :15.89 Mean :16.272 Mean :16.54
3rd Qu.:24.99 3rd Qu.:24.990 3rd Qu.:24.99
Max. :69.99 Max. :59.990 Max. :59.99
This code provides numerical information about the price of each variable.
# Convert the data frame to a long format
price_long <- pivot_longer(price_df,
cols = c(Games_Price, Sims_Price, Action_Price),
names_to = "Game_Type",
values_to = "Price")
# Create the box and whisker plot using ggplot2
ggplot(price_long, aes(x = Game_Type, y = Price, fill = Game_Type)) +
geom_boxplot() +
labs(title = "Price Distribution Across Game Types",
y = "Price",
x = "Game Type") +
scale_fill_manual(values = c("mediumseagreen", "grey", "cadetblue3")) +
theme_minimal() +
stat_summary(fun = "median",
geom = "text",
aes(label = round(after_stat(y), 2)), # Rounds the median value (2 decimals for currency)
vjust = -0.5, # Adjusts vertical position to be slightly above the median line
color = "white")The graph shows that simulation games tend to be priced higher than other popular games. However, the most expensive games are floating around the most popular games.
# I found it interesting that the action and top games had a different max price by such a large margin and so I wanted to see what games cost the most in the top 250.
Most_Expensive_250 <- Top_250_Games[Top_250_Games$Price == max(price_df$Games_Price), ]
print(Most_Expensive_250)AppID <int> | Name <chr> | Release.date <chr> | Min.estimated.owners <dbl> | Max.estimated.owners <dbl> | Price <dbl> | ||
|---|---|---|---|---|---|---|---|
| 155 | 1716740 | Starfield | 5-Sep-23 | 5e+06 | 1e+07 | 69.99 |
NA‘Starfield’ is a game by a major studio, Bethesda, and is an open world space simulation with realistic graphics that took around 7 years to develop.
Regarding the first main question, the first P of the 4P’s is place. In this case, the place is Steam. The analysis will begin by looking at the distribution of genre tags among all games on Steam then counting games by their ownership. Estimated ownership is the best metric for success in my data set as there is no set quality value as it is a very complex subject. Using this variable, I decided to use it to look into the second P, product. In this case, the product is the genre because genres tend to have similarities and different target audiences making it a good measure for separating one product from another. For this question, a comparison of the estimated owners will be completed.
The third P, price, was an interesting aspect. First, the percentage of free to play games among the top 250, top simulation games, and top action games will be compared to each other to decide if the developer in the simulation genre could expect that free to play games in the Sims genre will perform better. After this, the prices for each game type will be analyzed through the use of a box and whisker plot. Despite the fourth P, promotion being Important, it was decided that the fourth P will not be covered as directly as the other three aspects. Funds spent on promotion is not an available value with this dataset, instead the other 3 will be the focus and this aspect may be considered more holistically with the assumption that game ownership in itself is a form of promotion.
The type of data analysis that was chosen to be the focus is bivariate. This method was chosen because most of the analysis surrounds comparison between two variables. It is believed that the answer to how necessary the 4P’s marketing method is on Steam can be measured by looking at estimated game ownership. This is the case because games are supported by their communities and games with strong communities tend to be both recommended by their players to other potential players and played on streaming sites which build interest in the game and build a strong community.
Through the analysis process it is important to keep in mind that different genres cater to different interests that fill niches, meaning there is no perfect game. In completing the analysis, the goal is to understand what Steam looks like for the majority of games on the platform and how expectations differ between genres regarding price and ownership. To find out what makes a game successful I first needed to run a correlation analysis to confirm that a strong community is the driving force behind an increase in most variables, this is what led me to sort by Ownership. The purpose of choosing the action genre to use as a comparison in addition to the top 250 games was to learn about what the majority of consumers on Steam prefer and using the variables as a basis for comparison. Using this method, the plan is to see if there are differences between genres on steam, following the Product quarter of the 4P’s.
The simulation genre as a whole is a popular genre that is full of aspects that make the genre interesting to different types of players. These games often include a high level of player freedom meaning such games can be more expensive to produce. In this case, among the top 250 simulation games, the majority of those games have an average price of $14.99 with around only 20% of the games being free to play. With this in mind, simulation games tend to be complex and have long development times, driving prices up, this being said, it was found that games that are in the top 250 still do not reach Nintendo’s typical price with the most expensive game in the top 250 being 69.99 and ranking higher than only approximately 60% of the top games despite its high price tag.
There is a strong correlation between player engagement and the ownership of a game. Unexpectedly, the action genre and simulation games tend to have similar ownership. This was surprising to me because I postulated that players would spend time on games with more story-based aspects but what I found is that players will put time into what they are interested in whether the game is full of combat or a slow-paced farming game. Therefore, if a developer would like to make a game, building a strong community is the key to increasing ownership.
The Steam Platform as a whole is full of a wide variety of games that tend to lean closer to the indie developer as the platform is used to publish games rather than present a selected variety of games like the Nintendo store. The majority of games on this platform tend to see an estimated 1000 copies sold, however a few games reach estimated owners in the 10 millions. Steam responds best to the action genre and there are significant differences between the action and simulation genres. If a developer were looking to distribute a simulation game, the developer can expect that a successful Sim game based on game ownership would be different from an action game.
Regarding price, the data shows that the average price for a genre differs with action games tending to be priced around the $10 range and simulation games being priced around $14.99. This means that a game in the simulation genre being priced higher than the average popular game price is will not be out of the ordinary. Additionally, a game being free to play does not always lead to higher ownership. Overall, based on the percentage of positive reviews for games being rather consistent across all levels of ownership, it really is about being able to build a strong community and though marketing can be a really great boost a game will tend to make it to its intended audience, it really just depends on what level of ownership the developer is aiming for that changes the amount of effort and time that should be put into marketing because it is ultimately a good game is not measured only by its ownership.
Limitations of the data set would be the lack of total game revenue and a definite measure for the quality of a game. Additionally, which market the data comes from is not stated. If this variable were to be available, it would be interesting to complete an analysis comparing Steam’s stores for each country. An example of how this data may be used would be to compare how genres perform in different countries.
Going Indie Biz. (2024, July 27). Steam marketing expert reveals algorithm secrets [Video]. YouTube. https://www.youtube.com/watch?v=CfSThW3GwvM
Summer. (2025, March 17). The Four PS of Marketing: Definition, Role, and Questions to consider - Mageplaza. The Most Feature-rich Extension Developer for Magento 2 - Adobe Commerce. https://www.mageplaza.com/blog/4-ps-of-marketing.html
FronkonGames. “Steam Games Dataset.” Kaggle, 24 Aug. 2023, www.kaggle.com/datasets/fronkongames/steam-games-dataset.
“Simulation video game.” Wikipedia, 3 Nov. 2024, en.wikipedia.org/wiki/Simulation_video_game.
“Starfield (video game).” Wikipedia, 6 Nov. 2024, en.wikipedia.org/wiki/Starfield_(video_game).
“Steam Charts: Top 100.” Steam, store.steampowered.com/charts/. Accessed 9 Nov. 2025.
“Store Home.” Steam, store.steampowered.com/games. Accessed 9 Nov. 2025.
“Top questions.” Stack Overflow, stackoverflow.com/questions. Accessed 9 Nov. 2025.
“Why Are Games So Expensive Now? Factors Driving Up Game Prices.” DHgate Smart Blog, smart.dhgate.com/why-are-games-so-expensive-now-factors-driving-up-game-prices/. Accessed 9 Nov. 2025.